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Abstract 

Alzheimer's disease (AD), and its precursor state, mild cognitive impairment 
(MCI), continue to be widely studied. While there is no consensus on whether 
MCIs actually "convert" to AD, this concept is widely applied, allowing sta- 
tistical testing and machine learning methods to help identify early disease 
biomarkers and build models for predicting disease progression. Thus, the 
more important question is not whether MCIs convert, but what is the best 
such definition. We focus on automatic prognostication, nominally using only 
a baseline image brain scan, of whether an MCI individual will convert to AD 
within a multi-year period following the initial clinical visit. This is in fact 
not a traditional supervised learning problem since, in ADNI, there are no 
definitive labeled examples of MCI conversion. It is not truly unsupervised, 
either, since there are (labeled) AD and Control subjects, as well as clini- 
cal and cognitive scores for MCIs. Prior works have defined MCI subclasses 
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based on whether or not clinical/cognitive scores significantly change from 
baseline. There are serious concerns with these definitions, however, since 
e.g. most MCIs (and ADs) do not change from a baseline CDR=0.5 at any 
subsequent visit in ADNI, even while physiological changes may be occur- 
ring. These works ignore rich phenotypical information in an MCI patient's 
brain scan and labeled AD and Control examples, in defining conversion. 
We propose an innovative conversion definition, wherein an MCI patient is 
declared to be a converter if any of the patient's brain scans (at follow-up 
visits) are classified "AD" by an (accurately-designed) Control-AD classifier. 
This novel definition bootstraps the design of a second classifier, specifically 
trained to predict whether or not MCIs will convert. This second classifier 
thus predicts whether an AD-Control classifier will predict that a patient 
has AD. Our results demonstrate this new definition leads not only to much 
higher prognostic accuracy than by-CDR conversion, but also to subpopu- 
lations much more consistent with known AD brain region biomarkers. We 
also identify key prognostic region biomarkers, essential for accurately dis- 
criminating the converter and nonconverter groups. 

Keywords: Alzheimer's, mild cognitive impairment, AD conversion, MRI, 
support vector machines, feature selection, AD biomarkers, MFE, RFE 



1. Introduction 



The dementing illness Alzheimer's disease (AD), and the transitional 
state between normal aging and AD referred to as mild cognitive impair- 
ment (MCI) continue to be widely studied. Individuals diagnosed with MCI 
have memory impairment, yet without meeting dementia criteria. Annually 



10-15% of people with MCI are diagnosed with AD (Petersen, 2004). More 



over, prior to symptom onset, brain abnormalities have been found in people 
with MCI, as ascertained by retroactive evaluation of longitudinal MRI scans 



(Davatzikos et al. 2008). There is no consensus on whether MCI patients 



actually "convert" to AD. First, some MCI patients may suffer from other 
neurodegenerative disorders (e.g., Lewy body dementia, vascular dementia 
and/or frontotemporal dementia). Second, it is possible that all other MCI 
patients already have AD, but at a preclinical stage. AD diagnosis itself may 
not be considered definitive without e.g. confirming pathologies such as the 
amyloid deposits detectable at autopsy. Regardless of whether MCI patients 
truly "convert" to AD or not, the concept of MCI-to-AD conversion has been 
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widely applied, e.g. ( 


Chou et al. , 2009 


] 


Davatzikos et al. , 


2010 


Misra et al. 


2009 Vemuri et al.[ 


2009 Zhang et al. 


2011 


) and is utilitarian - denning 



MCI (converter and nonconverter) subgroups allows use of statistical group 
difference tests and machine learning methods to help identify early disease 
biomarkers and to build models for predicting disease progression. For these 
purposes, the more important question is not whether MCIs convert, but 
rather what is the best such definition. 

Accordingly, here we focus on the following Aim: automatic prognos- 
tication, (nominally) using only a baseline brain scan, of whether an MCI 
individual will convert to AD within a multi-year (three year) period follow- 
ing an initial (baseline) clinical visitQ 

While only image voxel-based features are evaluated here for use by our 
classifier, our framework is extensible to incorporating other baseline clinical 
information (e.g. weight, gender, education level, genetic information, and 
clinical cognitive scores such as the Mini Mental State Exam (MMSE)) into 
the decisionmaking. Moreover, our approach can also incorporate the recent, 
promising cerebrospinal fluid (CSF) based markers (De Meyer et al. , 2010). 
However, as this requires an invasive spinal tap, we focus here on image scans, 
which are routinely performed for subjects with MCI. 

We do not hypothesize that, within ADNI, there are actually two sub- 
classes of MCI subjects when evaluated over the very long term - those that 
(eventually) convert to AD, and those that do not. Even if an overwhelm- 
ing majority of MCI subjects will eventually convert, identifying the sub- 
group likely to convert within several years has several compelling utilities: 
1) early prognosis, to assist family planning; 2) facilitating group-targeted 
treatments/drug trials; 3) we identify key prognostic brain "biomarker" re- 
gions, i.e. those found to be most critical for accurately discriminating our 
"converter" and "nonconverter" groups. These regions may shed light on 
disease etiology. 

Distinguishing AD converters from nonconverters is a binary (two-class) 
classification problem. Moreover, it may appear this classification problem 
can be directly addressed via supervised learning methods (Duda et al. , 2001 ). 
However, it is in fact an unconventional problem, lying somewhere between su- 
pervised classification and unsupervised classification (clustering), and thus 



■""Our system performs three-year ahead prediction because it is designed based on the 
ADNI database, which followed participants for a period of up to three years. 
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requiring a unique approach. To appreciate this, consider the ADNI cohort 
of MCI individuals. ADNI consists of clinical information and image scans on 
hundreds of participants, taken at six-month intervals for up to three years. 
A clinical label (AD, MCI, or Control) was assigned to each participant at 
first visiij^} Even though a probable AD definition based on CDR and MMSE 



scores and NINCDS/ADRDA criteria has been used, e.g. (Leow et al. , 2009 



Zhang et al. |2011 ), to provide follow-up assessment for MCI patients, this is 



strictly a clinically driven definition, based on a clinical rating (CDR) and a 
cognitive score (MMSE) whose difficulties will be pointed out shortly. This 
is not a definitive (autopsy-based) determination of AD, nor is it a definition 
based on physiological brain changes. Even if the probable AD definition has 
very high specificity, it may not be sufficiently sensitive, i.e. there may be pa- 
tients who are undergoing significant physiological brain changes consistent 
with conversion, yet without clinical manifestation. 

Accordingly, we will approach the conversion problem from a perspective 
as agnostic and unbiased as possible, and simply state that it is not defini- 
tively known which MCI participants in ADNI truly converted to AD within 
three years. In conventional supervised classifier learning, one has labeled 
training examples, used for designing the classifier, and labeled test exam- 
ples, used to estimate the classifier's generalization accuracy. For predicting 
whether MCI participants in ADNI convert to AD, we in fact have neither. 
Thus, our problem is not conventional supervised learning. On the other 
hand, consider unsupervised clustering (Duda et al. , 2001). Here, even if one 
knows the number of clusters (classes) present, there is no prior knowledge 
on what is a good clustering - one is simply looking for underlying group- 
ing tendency in the data. Clearly, our problem does not fit unsupervised 
clustering, either - while we have no labeled MCI converter /nonconverter 
instances per se, 1) there are two designated classes of interest (converter 
and nonconverter); and 2) there are known class characteristics - conversion 
to AD should, plausibly, mean that: (i) a clinical measure such as CDR or 
a cognitive measure has changed and/or (ii) there are changes in brain fea- 
tures more characteristic of AD subjects than normal/healthy subjects. Note 
that in ADNI we do have plentiful labeled AD and normal/healthy (Control) 



Clinicians derive the AD/MCI/Control label based on multiple criteria, which may 
include Clinical Dementia Rating (CDR), whose possible values are: 0=none, 0.5=ques- 
tionable, l=mild, 2=moderate, 3=severe. 
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examples to help assess ii). 

Based on the above, MCI prognosis is an interesting and novel problem, 
lying somewhere between supervised and unsupervised classification. The 
crux of this problem is to craft criteria through which meaningful MCI sub- 
groups can be defined, well- capturing notions of "AD converter" and "non- 
converter". To help guide development and evaluation of candidate defini- 
tions, we state the following three desiderata: 1) The proposed definition 
of AD converter should be plausible and should exploit the available, rele- 
vant information in the ADNI database (e.g. image data, labeled AD and 
Control examples, and clinical information). To appreciate 1), note that 
the MCI population could be dichotomized in many ways, e.g. by height, 
and there might be significant clustering tendency with respect to height, 
but such a grouping is likely meaningless for MCI prognosis; 2) A classifier 
trained based on these class definitions should generalize well on test data 
(not used for training the classifier) - this quantifies how accurately we can 
discriminate the classes that we have defined. If we create what we believe 
to be good definitions, but ones that cannot be accurately discriminated, 
that would not be useful clinically; 3) The class definitions should be vali- 
dated using known AD conversion biomarkers (i.e., external measures) such 
as measured changes from baseline both in volumes of brain regions known 
to be associated with the disease (Schuff et al. 2010) and in cognitive test 
scores (such as the clinical MMSE measure). 
Prior Related Work 



Several prior works, e.g. (Davatzikos et al. 2010; Misra et al. 2009; Wang 



et al. , 2010), defined converter and nonconverter classes solely according to 



whether the baseline visit CDR score of 0.5 rose or stayed the same over 
all visits. Change in CDR has also been used as surrogate ground-truth for 



cognitive decline in a number of other papers, e.g. (Chou et al. , 2010; Vemuri 



et al., 2009). While CDR gives a workable conversion definition, it should 



be evaluated with respect to the three desiderata above. We will evaluate 2) 
and 3) in the sequel. With respect to 1), one should challenge a CDR-based 
conversion definition. First, CDR is not an effective discriminator between 
the AD and MCI groups, i.e. there is very significant AD-MCI overlap, 
not only with respect to CDR=1 but even 0.5 - for ADNI, the majority of 
the hundreds of AD subjects used in our experiments start (at first visit) 
at CDR=0.5 and stay at 0.5 at all later visits; likewise, nearly all MCI 
subjects start at 0.5, with a large majority of these also staying at 0.5 for 
all visits. This latter fact further implies difficulties in finding an adequate 
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number of conversion-by-CDR subjects in ADNI, both for accurate classifier 
training and test set evaluation. For the even more stringent probable AD 
definition (meeting MMSE and NINCDS/ADRDA criteria, in addition to 
CDR changing from 0.5 to 1) there are necessarily even fewer MCI converters 
for classifier training and testing. Second, a purely CDR-based (or "probable 
AD" based) conversion definition ignores the (rich) phenotypical information 
in an MCI subject's image brain scans and does not exploit the labeled 
AD and normal/healthy (Control) examples in ADNI. These prior works do 
treat features derived from brain scans as the covariates (the inputs) to the 
classifier/predictor. However, we believe the MCI brain scans can themselves 
be used, in conjunction with the labeled AD and Control examples, to help 
define more accurate surrogate ground-truth. 

Previous work has demonstrated that structural MRI analysis is useful 



for identifying AD biomarkers in individual brain regions (Chetelat et al. 



2002 Fan et al. 



ning (Lerch et al. 



2008 Fennema-Notestine et al. , 2009) - e.g., cortical thin 



2005 Thompson et al. 2003), ventricle dilation and gaping 



(Chou et al. 2010, 2009 Schott et al. 2005), volumetric and shape changes 



in the hippocampus and entorhinal cortex (Csernansky et al. , 2005 de Leon 



et al. , 


2006 


al. 


2004 


). It 



Stoub et al., 2005) 



2005 


de Leon 


•e ( 


Rusinek et 



al. , 2004). It is important to capture interaction effects across multiple brain 



regions. (Davatzikos et al. 2008 Misra et al., 2009 Vemuri et al., 2008 



Wang et al. , 2010) did jointly analyze voxels (or regions) spanning the entire 



brain and did build classifiers or predictors. Moreover, as part of their work, 
(Wang et al. , 2010) investigated prediction of future decline in MCI subjects 
working from baseline MRI scans, which is the primary subject of our cur- 
rent paper. However, there are several limitations of these past works. First, 
all these studies used the previously discussed CDR and cognitive measures 
such as MMSE, which has been described as noisy and unreliable, as the 



ground-truth prediction targets for classifier /regressor training. In (Chou et 



al. , 2010), the authors state: "Cognitive assessments are notoriously variable 



over time, and there is increasing evidence that neuroimaging may provide 
accurate, reproducible measures of brain atrophy." Even in (Wang et al. 



2010), where MMSE was treated as the measure of decline and the ground- 



truth regression target, the authors acknowledged that "individual cognitive 
evaluations are known to be extremely unstable and depend on a number 
of factors unrelated to... brain pathology." Such factors include sleep depri- 
vation, depression, other medical conditions, and medications. Even though 
MMSE is widely used by clinicians, these comments (even if not universally 
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accepted), do indicate MMSE by itself may not be so reliable in quantifying 



the disease state. Moreover, while (Wang et al. , 2010) did build predictors of 



future MMSE scores working from baseline scans, this was not a main focus 
of their paper - their paper focused on predicting the current score. Their 
prognostic experiments involved a very small sample size (just 26 participants 
from the ADNI database). Accordingly, it is difficult to draw definitive con- 
clusions about the accuracy of their prognostic model and their associated 
brain biomarkers. The main reason the authors chose such a small sample 
was, as the authors state: "A large part of... ADNI... are from patients who 
did not display significant cognitive decline... [these] would overwhelm the re- 
gression algorithm if.. used in the. ..experiment." While this statement (with 
cognitive decline measured according to MMSE) may be true, that does not 
mean many of those excluded ADNI subjects are not experiencing significant 
physiological brain changes/atrophy. The novel approach we next sketch is 
well-suited to identifying MCI subjects undergoing such changes. 
Our Neuroimaging-Driven, Trajectory-Based Approach 

Here, we propose a novel approach for prognosticating putative conver- 
sion to AD driven by image-based information (and exploiting AD-Control 
examples), rather than by a single, non-image-based, weakly discriminat- 
ing clinical measurement such as CDR. Our solution strategy is as follows. 
We first build an accurate image-based Control-AD classifier (i.e., using AD 
and Control subjects, we build a Support Vector Machine (SVM) classifier) 
(Vapnik, 1998). We then apply this classifier to a training population of 
MCI subjects - separately, for each subject visit, we determine whether the 
subject's image is on the AD side or the Control side of the SVM's fixed 
(hyperplane) decision boundary. In addition to a binary decision, the SVM 
gives a "score" - essentially the distance to the classifier's decision boundary. 
Thus, for each MCI subject, as a function of visits, we get an image-based 
"phenotypical" score trajectory. We fit a line to each subject's trajectory 
and extend the line to the sixth visit if the sixth visit is missing. We can 
then give the following trajectory-based conversion definition: if the extended 
line either starts on the AD side or crosses to the AD side over the six visits, 
we declare this person a "converter- by-trajectory". Otherwise, this person 
is a "nonconverter-by-trajectory"|^j In this fashion, we derive ground-truth 



3 A very small percentage of the MCI population, in our experiments often 1% and not 
exceeding 5%, may unexpectedly start on the AD side and cross to the Control side. We 
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"converter" and "nonconverter" labels for an (initially unlabeled) training 
MCI population. These (now) labeled training samples bootstrap the de- 
sign of a second SVM classifier which uses only the first-visit training set 
MCI images and is trained to predict whether or not an MCI patient is a 
"converter-by-trajectory". Essentially, this second (prognostic) classifier is 
predicting whether, within three years, an AD-Control classifier will predict 
that a patient has AD. Via these two classifier design steps, we thus build a 
classification system for our (unconventional) pattern recognition task. 

SVMs are widely used classifiers whose accuracy is attributed to their 
maximization of the "margin", i.e. the smallest distance from any training 
point to the classification boundary. Since the SVM finds a linear discrimi- 
nant function that maximizes margin, a significant change in score is generally 
needed to cross from the control side to the AD side, which is thus suggestive 
of conversion from MCI to AD. This is the premise underlying our approach. 

The main contributions of our work are: 1) a novel image-based prognos- 
ticator of MCI-to-AD conversion that we will demonstrate to achieve both 
better generalization accuracy and much higher correlation with known brain 
region biomarkers than the CDR-based approach; 2) The finding that MCI 
subgroups that are strongly correlated with known AD brain region biomark- 
ers are not so strongly correlated with "cognitive decline" as measured by 
MMSE; 3) Identification of the brain regions most critical for accurately 
discriminating between our "converter" and "nonconverter" groups, via ap- 
plication of margin-based feature selection (MFE) (Aksu et al. , 2010) to brain 
image classification, and demonstration of MFE's better performance than 



the well-known RFE method (Guyon et al. 2002) on this domain 



treat these individuals as outliers and omit them from our experiments. 
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2. Methods 



2.1. Subjects and MRI data 

We used Ti-weighted ADNI images^] that have undergone image correc- 
tion described at the ADNI websiteJ3 ADNI aims to recruit and follow 800 
research participants in the 55-90 age range: approximately 200 elderly Con- 
trols, 400 people with MCI, and 200 people with AD. The number of Control, 
MCI, and AD participants in our analysis were ~180, 300, and 120, respec- 
tively - experiment-specific detailed descriptions will be provided in Sec. [3j 
We processed the Ti-weighted images as described in Appendix A produc- 
ing new images from which we then obtained the features (next discussed) 
used by our statistical classifiers. 



2.2. Features for classification 

We chose as features the voxel intensities of a processed^ RAVENS im 
age, a type of "volumetric density" image (Davatzikos, 1998| Davatzikos et 
aT 



2001 Goldszal et al., 1998 Shen et al. 



e.g. (Davatzikos et al. 2010 



studies e.g. (Fan et al., 2007 



for voxel-based analysis (Davatzikos et al. 2001) and applied both to AD 



sue maps (gray matter (GM~ 



2003) that has been validated 



Misra et al. 2009 Wang et al. 2010 ) and other 



For each of the three processed RAVENS tis- 
white matter (WM), and ventricle), to reduce 



4 Data used in the preparation of this article were obtained from the Alzheimer's Disease 
Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). The ADNI was launched 
in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical 
Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private 
pharmaceutical companies and non-profit organizations, as a $60 million, 5-year public- 
private partnership. The primary goal of ADNI has been to test whether serial mag- 
netic resonance imaging (MRI), positron emission tomography (PET), other biological 
markers, and clinical and neuropsychological assessment can be combined to measure the 
progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). De- 
termination of sensitive and specific markers of very early AD progression is intended to 
aid researchers and clinicians to develop new treatments and monitor their effectiveness, 
as well as lessen the time and cost of clinical trials. 

5 ADNI image correction steps include Gradwarp, N3, and scaling for gradient drift - 
see www.loni.ucla.edu/ADNI/Data/ADNI_Data.shtml. 



6 We describe our proce ssing of RAVENS imag es in Appendix A 
7 Of particular interest, (Davatzikos et al 



2001) supported that voxel-based SPM statis 



tical analysis, which we perform herein for comparison with our methods, can be performed 
on RAVENS images. 
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complexity for subsequent processing, we obtained a subsample by succes- 
sively skipping five voxels along each of the three dimensions, and took as 
feature set the union of the three subsampled maps. We will also report 
results for the case of skipping only two voxels, rather than five. 

Since high- dimensional nonlinear registration (warping) of all individuals 
to a common atlas (via HAMMER QShen et al.[ [2002] )) is applied m pro- 



ducing our features, they capture both volumetric and morphometric brain 
characteristics, which is important since individuals with AD/MCI typically 
exhibit brain atrophy (affecting both volume and shape). 

2.3. Classification and feature selection for high- dimensional images 

A challenge in building classifiers for medical images is the relative paucity 
of available training samples, compared to the huge dimensionality of the 
voxel space and, thus, to the number of parameters in the classifier model - 
in general, the number of parameters may grow at least linearly with dimen- 
sionality. In the case of 3D images, this could imply even millions of param- 
eter values (e.g. one per voxel) need to be determined, based on a training 
set of only a few hundred patient examples. In such cases, classifier over- 
fitting is likely, which can degrade generalization (test set) accuracy. Here 
we will apply a linear discriminant function (LDF) classifier with a built-in 
mechanism to avoid overfitting and with design complexity that scales well 



with increasing dimensionality - the support vector machine (SVM) (Vap 



nik| |1998 ). The choice of LDF achieving perfect separation (no classification 



errors) for a given two-class training set is not unique. The SVM, however, 
is the unique separating LDF that maximizes the margin, i. e. the minimum 
distance to the classifier decision boundary, over all training samples. In this 
sense, the SVM maximizes separation of the two classes. For an SVM, unlike 
a standard LDF, the number of model parameters is bounded by the number 
of training samples, rather than being controlled by the feature dimension- 
ality. Since the number of samples is the much smaller number for medical 
image domains, in this way the SVM greatly mitigates overfitting. SVMs 
have achieved excellent classification accuracy for numerous scientific and 



engineering domains, including medical image analysis, (Aksu et al. , 2010 



Davatzikos et al. 2010 Guyon et al. 2002) 



Even though SVMs are effective at mitigating overfitting, generalization 
accuracy may still be improved in some cases by removing features that 
contribute little discrimination power. Moreover, even if generalization ac- 
curacy monotonically improves with increasing feature dimensionality, high 
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complexity (both computation and memory storage) of both classifier design 
and class decisionmaking may outweigh small gains in accuracy achieved by 
using a huge number of features. Most importantly here, it is often useful 
to identify the critical subset of features necessary for achieving accurate 
classification - these "markers" may shed light on the underlying disease 
mechanism. In our case, this will help to identify prognostic brain regions, 
associated with MCI conversion. 

Unfortunately, there is a huge number of possible feature subsets, with 
exhaustive subset evaluation practically prohibited even for modest number 
of features, M, let alone M ~ 10 6 . Practical feature selection techniques are 
thus heuristic, with a large range of tradeoffs between accuracy and complex- 
ity (Guyon et al. , 2003). "Front-end" (or "filtering") methods select features 
prior to classifier training, based on evaluation of discrimination power for 
individual features or small feature groups. "Wrapper" methods are gen- 
erally more reliable, interspersing sequential feature selection and classifier 
design steps, with features sequentially selected to maximize the current sub- 
set's joint discrimination power. There are also embedded feature selection 
methods, e.g. for SVMs, use of £i-regularization within the SVM design opti- 
mization (Fung et al. , 2004), in order to find "sparse" weight vector solutions, 
which effectively eliminate many features. For wrappers, there is greedy for- 
ward selection, with "informative" features added, backward elimination, 
which starts from the full set and removes features, and more complex bidi- 
rectional searches. In our work, due to the high feature dimension, we focus 
on two backward elimination wrappers that afford practical complexity: i) 
the widely used recursive feature elimination algorithm (RFE) (Guyon et al.| 



2002), where at each step one removes the feature with least weight magni- 



tude in the SVM solution. RFE has been applied before to AD (Davatzikos 



et al., 2010 Misra et al., 2009 



[Wang et al. , 2010); ii) the recent margin- 
based feature elimination (MFE) algorithm (Aksu et al. |2010 ), which uses 
the same objective function (margin) for feature elimination, one consistent 



with good generalization, that the SVM uses for classifier training (Vapnik 



1998). MFE was shown in (Aksu et al., 2010) to outperform RFE (Guyon et 



al. 



2002|) and to achieve results comparable to embedded feature selection 

Here 



for domains with up to 8,000 features (gene microarray classification 
we will also find that MFE gives better results than RFE. 
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2.4- An MRI-Derived Alternative to CDR-based MCI-to-AD Conversion 

In the Introduction, we outlined our two classifier design steps for build- 
ing an automatic prognosticator for an individual with MCI. In this sec- 
tion, we elaborate on these two steps and give an illustrative example. Our 



AD-Control classifier, used in the first step, is discussed in Sec. 2.4.1 , and 



our second classifier, used to discriminate converter-by-trajectory (CT) and 



nonconverter- by-trajectory (NT) classes, is discussed in Sec. 2.4.2 



2.4-1- AD-Control classifier 

For the AD training population, we chose individual AD visit images with 
a CDR score of at least 1. For the Control training population, on the other 
hand, we only chose initial visits, and only those for participants who stayed 
at CDR=0 throughout all their visits. Thus, we excluded Controls with 
"questionable dementia" (i.e., CDR=0.5) at any visit. By these choices, we 
sought to exclude outlier examples or even possibly any mislabeled examples, 
recalling that CDR for the majority of both AD and MCI participants is 0.5 
throughout all visits. 

2-4-2. CT-NT classifier 

Fig. [T] gives an illustrative example of the phenotypical score trajecto- 
ries for MCI subject described in the Introduction. A positive score is on the 
Control side and a negative score is on the AD side - the x-axis represents the 
AD-Control SVM's decision boundary. Score vs. age is plotted, with each line 
segment a trajectory obtained by linearly fitting an individual's phenotype 
scores (and linearly extrapolating if there are missing visits). Nonconverters- 
by-CDR (N-CDR) and converters-by-CDR (C-CDR) are illustrated in (a) 
and (b), respectively. Green and black subjects are those whose fitted tra- 
jectory stayed on the Control side and AD side, respectively, whereas gray 
lines are subjects who crossed to the AD side. Thus, by our conversion-by- 
trajectory definition, the green group is the nonconverters-by-trajectory, and 
the black and gray groups together are the converters-by-trajectory. Subject 
counts for these groups are given in the figure legends. The outlier sub- 
jects are shown in orange - there are five, making up less than 2% of the 
MCI cohort. Notice, intriguingly, from the left figure that more than one 
third of all (non-orange) MCI patients (106 of 298) are converters by tra- 
jectory and yet nonconverters according to CDR - i.e., there is a very large 
percentage of patients for which the two converter definitions disagree, with 
the neuroimage-based definition indicating disease state changes that are not 
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Figure 1: AD-Control SVM score trajectories for MCI subjects, (a) Nonconverters-by- 
CDR. (b) Converters-by-CDR. 



predicted using the clinical, CDR-based definition. Likewise, an additional 
3% of all MCI subjects (11 of 298) "defy" their by-CDR converter label in 
that they do not reach the AD side of the decision boundary 

Based on these trajectories, i.e. whether or not the AD side is visited, we 
derive the ground-truth £ CT' and 'NT' labels for all MCI subjects. We then 
build a CT-NT classifier using as input only the image scans at initial visit r\ 



3. Results and Discussion 

3.1. Introductory overview 

In this section we will perform 1) classification experiments to evaluate 
conversion-by-trajectory and conversion-by-CDR with respect to desidera- 
tum 2; 2) additional experiments to compare the two definitions with re- 
spect to desideratum 3; and lastly, 3) experiments to identify prognostic 
brain "biomarker" regions. 

It is important to mitigate the potential confounding effect of the subject's 
age. In our classification experiments, we mitigated in two ways: 

1) For every classifier training, each training sample in one class was 
uniquely paired via "age-matching" with a training sample in the other class 
(with age separation at most one year). 

2) For every linear-kernel SVM classifier, we separately adjusted each 
feature for age prior to classification using linear fitting. We subtracted 
the extrapolated line (computed only using "control" samples^]) from the 
feature's value, for all (training and test) samples. As an aside, we note 
that, given the subsequent linear SVM operation, the representation power 
of this linear fitting step is essentially equivalent to simply treating age as 
an additional feature input to the linear SVM classifier. 

Finally, prior to building classifiers, we normalized feature values to the 



[0,1] range, which is suitable for the LIBSVM software (Chang et al. 2001) 
we used for training SVMs. 



8 For a small percentage of the MCI subjects, we did not obtain the patient's first visit. 
However, we did ensure that the visit we took as the "initial visit" had a CDR of 0.5. 

9 For the AD-Control classifier, these are the samples in the Control class. For the 
CT-NT classifier we computed the line using only the NT samples. 



14 



250 



48 



□ N-CDR 
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□ C-CDR- NT 

□ C-CDR-CT 
■ N-CDR-CT 

□ N-CDR-NT 



37 



106 



(a) By-CDR. (b) By-trajectory. (c) Overlap. 

Figure 2: A population of 298 MCI subjects in ADNI is shown here, broken up according to 



the two criteria discussed in Sec. 3.2.1 (a) by-CDR criterion, (b) by-trajectory criterion; 
with overlap shown in (c). 



3.2. Experiments with voxel-based features 

The test set (generalization) accuracy of the voxel-based AD-Control clas- 
sifier, built using 70 training samples per class, was 0.89 (86 of 88 Controls, 
and 16 of 27 AD subjects, were correctly classified.) This classifier, with high 
specificity for Controls, was then applied to a population of MCI subjects to 
determine the CT and NT subgroups. 

3.2.1. Classification experiments for the MCI population 



Fig. 2(a) shows the sizes of the converter-by-CDR (C-CDR) and nonconverter- 



by-CDR (N-CDR) groups within the ADNI MCI cohort for a typical exper- 



iment in our work. Fig. 2(b) shows the same population broken up as 



converters-by-trajectory (CT) and nonconverters-by-trajectory (NT). Super- 



imposing the two charts, Fig. [2(c) illustrates their overlap, where converters 
by both definitions are accordingly indicated by orange. Since converters- 
by-CDR are relatively scarce, we used a large majority of them (80%, i.e. 39 
individuals among the 48) for the by-CDR classifier's training set, with the 
rest (20%) put into the test set. We reiterate that a general disadvantage of 
the by-CDR approach is its scarcity of converter examples - by contrast, a 
more balanced number of examples is available for by-trajectory training (at 



least 100, rather than 39, training samples per class, as in Fig. 2(b)). Note 
also that if we were to use a "probable AD" , rather than a by-CDR converter 
definition, there would be even fewer converter examples. 

A fair performance comparison between by-trajectory and by-CDR clas- 
sification requires: 1) using the same per-class training set size (i.e. 39) for 
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both by-CDR and by-trajectory training, and 2) making the test set sizes 
the same for both classifiers. There are several different ways in which the 
data can be partitioned into training and test sets, consistent with these two 
conditions: i) we can perform simple random selection on a class-by-class ba- 
sis, ensuring only that the two classifiers are given the same training/test set 
sizes (but not the same sets) J^jii) we can make the training sets of the two 
classifiers identical rather than merely same-sized, as well as make the test 
sets identical. This latter approach, though, will have some bias because, in 
selecting samples for the by-trajectory classifier, we will have to make use of 
knowledge of the samples' conversion-by-CDR status (and vice versa for the 
by-CDR classifier). The first approach, on the other hand, clearly does not 
have this bias. As both approaches are valid ways of dealing with by-CDR 
data limitations, we will compare generalization accuracies of by-CDR and 
by-trajectory classifiers under both these data selection schemes, referring to 
these approaches as "random" and "identical" in the sequel. 

Our training/test set selection procedure for the "identical approach" 
is as follows. For the C-CDR-CT group (Fig. 



2(c)), 



randomly select 80% 
such that 
2 fa")]) can be 



3(a)) 



of the group (the yellow striped group of size 30 in Fig. 
a corresponding group within N-CDR (white portion in Fig. 
found that is both NT and satisfies age-matching. This corresponding group 
is illustrated in Fig. |3(a)| as the white striped group (of size 30), placed 
opposite from the yellow striped area it is paired (matched) with. Likewise 
for the C-CDR-NT group (Fig. 2(c) ), randomly select 80% of the group (the 
yellow striped group of size 9 in Fig. 3(a) ) such that a corresponding group 
within N-CDR can be found that is both CTand satisfies age-matching. This 
corresponding group is illustrated in Fig. 3(a)| as the white striped group (of 
size 9). Notice by comparing this figure to Fig. 2(c) that the two white 
striped separated by the CT-NT border. We take the training set 

- shared by the by-CDR and by-trajectory classifiers - to be precisely the 
union of these four striped areasj^] Subsequently, we take the test set - 



10 Note that this means that the training sets for the converter and nonconverter classes 
of the conversion-by-CDR classifier are randomly selected from the yellow and white re- 
gions in Fig. 2(a) respectively, with no consideration of trajectory-based (i.e. red/white) 
labeling illustrated in Fig. 



2(b) 



11 For the by-CDR classifier, the class membership of any of these four subsets of the 

Likewise, for the 



training set is illustrated by the color being yellow or white in Fig. 3(a 



by-trajectory classifier, class membership is illustrated by red or white color in Fig. 3(b) 
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shared by the two classifiers - to be the subjects who are neither in 1) the 
training set (striped areas) nor in 2) the special set of subjects shown in solid 
(also shown identically in Fig. 



gray in Fig. 3 (a 



is the tiled areas in Fig. 3(a) (or, identically, in Fig. 3(b)) 



3(b)) That is, the test set 



Note above that some random selection is being employed in choosing the 
training/ test sets even in the "identical approach" (whereas, in the "random 
approach" the selection is completely random). Thus, for both approaches, 
the accuracy of performance comparison will benefit from averaging accuracy 
results over multiple training/test split "trials". Results averaged across 10 
trials are given in Fig. 3(c) for a linear-kernel SVM 13 \l ± a notation is used 



to indicate the mean fx and standard deviation a of quantities across the 
trials]^] Note that by-trajectory's generalization performance is as high as 
0.83, whereas by-CDR's generalization performance is very poor - as poor 
as random guessing (see 0.5 and 0.56 table values) - due mainly to poor 
performance on nonconverters-by-CDR. Fig. 3(d) and Fig. |3(e) show by- 



trajectory and by-CDR results, respectively, for one of the 10 trials (for the 
"random approach"), with each bar indicating distance to the classification 
boundary^ for an MCI subject in a test population of size 88 and non- 
converters/converters shown in left/right figures, respectively. Among the 
88 subjects, by-trajectory correctly classified 79 whereas by-CDR correctly 
classified only 40. 

Recently, similarly poor by-CDR classification performance was also re- 
ported in (Davatzikos et al. 2010), where it was found that the majority 



of (by-CDR) nonconverters "had sharply positive SPARE-AD scores indi- 
cating significant atrophy similar to AD patients". Since the SPARE- AD 
score is produced by a classifier that was trained to discriminate Control and 
AD patients (Fan et al. 2007, 2008), this comment and associated results 



are consistent both with our conjecture in the Introduction and our above 



12 We exclude this "special set" (in gray) from the test set so that all our experiments 
under the "identical approach" can have a shared, fixed test set (for fair comparison with 
each other), including, crucially, an experiment that will include this "special set" of 
samples in the training set. 

13 For generating all classification results herein, including those in Fig. 



3(c) we used 



SVM classifiers that were built by employing the common approach of bootstrap-based 
validation for selecting the classifier's (trial's) hyperparameter values (Aksu et al. 2010). 

14 \x and a of integer quantities (e.g. sample counts) are shown rounded up. 

15 Positive/negative distance means nonconverter/converter side of the boundary, re- 
spectively. 
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histogram results, which suggest that there may be a significant number of 
patients undergoing physiological brain changes consistent with conversion, 
yet without clinical manifestation. 

The results above indicate that the conversion-by-CDR definition's two 
classes are not well-discriminated, and thus, clinical usefulness of this def- 
inition for our prognostic Aim is expected to be poor. The much greater 
generalization accuracy of the by-trajectory definition (coupled with its in- 
herent plausibility as a conversion definition) indicates its greater utility. 
Increasing the By-Trajectory Image-Based Feature Resolution: 

In a separate experiment, we evaluated using one of 27 subsamples (rather 
than one of 216 subsamples), i.e. essentially a 10-fold increase in the number 
of (voxel-based) features, and found that the by-trajectory generalization ac- 
curacy rose to 0.91 in the "random" case. We then tried building 27 separate 
by-converter classifiers, one for every l/27th subsample (thus effectively us- 
ing the whole 3D image), with majority-based voting used to combine the 
27 decisions. This ensemble scheme again achieved 0.91 accuracy, i.e. there 
was no further accuracy benefit beyond that from an rs 10-fold increase in 
the number of voxel features. 
Increasing the By-Trajectory Training Set Size 

Note that the converter-by-CDR sample scarcity and class-balancing (via 
age-matching) in the experiments above had the effect of artificially limiting 
the by-trajectory classifier training set size. Next we investigated how much 
the generalization accuracy of by-trajectory classification improves when this 
limitation is removed. The tiled areas in Figures gajandgb]] are identical, 



illustrating that in this new experiment (Fig. 4(a) ) we used the same test set 
as previously, for fairness of comparison. However, as indicated by differences 
in the total striped area between these two charts, we now make the training 
set much larger than previously. Specifically, for the "identical" case, we used 
the previous 10 trials but simply augmented a trial's training set with the 
two large, previously-excluded gray sets^ as these two sets do age-match 
each other. The results, averaged across the 10 trials, are given in Fig. 
4(b)| Notice in this figure the now larger per-class training size (on average 
100 rather than 39), and that the random approach uses this size as 



well. The by-trajectory results in Fig. 4(b) indicate that accuracy improved 



from 0.78 ± 0.02 (Fig. 3(c)) to 0.84 ± 0.02 for the "identical approach", 



Shown in Fig. 3(b) with size 60. Note that this size can vary from trial to trial. 
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(a) Training/test set selec- (b) Training/test set selec- 
tion for by-CDR classifica- tion for by-trajectory clas- 
tion. sification. 



Sample 
selection 


Classifier 


Test set 


Converters 


Nonconverters 


Overall 
accuracy 


Count 


Accuracy 


Count 


Accuracy 


Random 
approach 


By-trajectory 


46 ±1 


0.81 ± 0.06 


45 ±1 


0.84 ± 0.05 


0.83 ± 0.05 


By-CDR 


9±1 


0.79 ±0.07 


82 ± 2 


0.47 ± 0.04 


0.50 ±0.04 


Identical 
approach 


By-trajectory 


45 ± 1 


0.78 ± 0.04 


47 ±2 


0.78 ± 0.04 


0.78 ± 0.02 


By-CDR 


9±1 


0.74 ±0.11 


82 ± 2 


0.54 ± 0.04 


0.56 ±0.04 



(c) Average test set classification accuracy using all f 1, 293 features, 39±1 per-class 
training samples. 



44 test samples 44 test samples 
[positives: 41 , nonpositives: 3] [positives: 6, nonpositives: 38] 79 test sam P les 9 1es1 samples 

j c t -j 5 , [positives: 33, nonpositives: 46] [positives: 2, nonpositives: 7] 




sample index sample index 50 123456789 

sample index sample index 

(d) By-trajectory. Left: nonconverters; (e) By-CDR. Left: nonconverters; Right: 
Right: converters converters 

Figure 3: Test set accuracy comparison of by-CDR and by-trajectory classification. 
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(a) Training/test set selection for 
by-trajectory classification. 





Sample 
selection 


Test set 


Converters 


Nonconverters 


Overall 
accuracy 


Count 


Accuracy 


Count 


Accuracy 


By-trajectory 


Random 


45 ±1 


0.77 ± 0.01 


47 ± 2 


0.86 ±0.02 


0.82 ±0.01 


Identical 


45 ±1 


0.85 ±0.03 


47 ± 2 


0.83 ±0.05 


0.84 ± 0.02 



(b) 

Figure 4: By-trajectory average test set classification accuracy for the larger training set 
size (98 ± 1 per class) and 11, 293 features. 



and modestly worsened for the "random approach" (from 0.83 ± 0.05 to 
0.82 ±0.01). 

Evaluating a Third Conversion Definition 

We now consider a third definition of conversion that combines the first 
two definitions as follows. Let "converters" consist of individuals who con- 



verted either by-trajectory or by-CDR (non-white areas in Fig. 2(c)), with 
the "nonconverter" class consisting of the remaining MCI individuals (white 
area). The point of view of this new definition, "conversion-by-union", is to 
be more inclusive in defining an MCI subpopulation at risk, which may ben- 
efit from early treatment or diagnostic testing. While from that perspective 
the new definition is reasonable, the fact that grouping individuals by CDR 
has a role in this definition may be its disadvantage, considering that by-CDR 
classification was previously shown to perform not much better than random 
guessing. Results, averaged across the same 10 trials used in Fig. [3| are 
given in Table [T] and indicate that conversion-by-union generalizes somewhat 
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Sample 
selection 


Test set 


Converters 


Nonconverters 


Overall 
accuracy 


Count 


Accuracy 


Count 


Accuracy 


By-union 


Identical 


48 ± 1 


0.78 ±0.05 


44 ±2 


0.75 ± 0.06 


0.77 ±0.03 



Table 1: Average test set accuracy of by-union classification for 39 ± 1 per-class training 
samples and 11,293 features. 



worse than conversion-by-trajectoryp] 

3.2.2. Validation on Known AD Conversion Biomarkers 

To validate the proposed conversion definitions with respect to desider- 
atum 3, we performed correlation tests on the MCI population between the 
binary class variable C G {0 = no conversion, 1 = conversion} and known 
AD conversion biomarkers consisting of both 1) volume in reported AD- 



affected regions (Table 2 in (Schuff et al. 2010)), which we measured for 



each individual's final-visit MR] 18 and 2) the clinical MMSE measure. The 
stronger the correlation, the more accurately the biomarker is predicted from 
the class variable and the greater the separation between the biomarker his- 
tograms, conditioned on the two classes. 

Before presenting correlation test results, we first illustrate in Fig. [5] 
the increased separation of the histograms of hippocampus volume for the 
converter and nonconverter groups in the by-trajectory case, compared with 
by-CDR. Next, we performed comprehensive statistical tests for a number of 
suggested AD biomarkers. The R statistical computing package was used to 
perform all tests with statistical significance set at the 0.05 level. In Table [2^i 



17 Notc that "by-union" is, by definition, an instance of the "identical approach", as 
stated in Table [T] To ensure fairness of comparison with the by-trajectory definition of 
conversion, our test sets, and training set sizes, in these two cases were identical. In fact, 
we chose the by-union training set to be as similar to by-trajectory's, in every trial, as 
possible. Referring to Fig. |3(b)| (which represents a trial example), the by- union training 
set was chosen to include 1) the two large striped groups (red and white); 2) the small 
"special" gray group (of size seven in this trial example) and its age-matched counterpart 
within small white-striped group, and; 3) a subset of the second small "special" gray group 
(two of five individuals in this trial example) and its age-matched counterpart within the 
small red-striped group. 

1 As discussed in Appendix A, we measured normalized region volume. Note also that 
our regions are defined based on the atlas (Atlas2) we used. The correspondence between 



the regions in (Schuff et al. 2010) and our defined regions is given in Appendix B. Finally, 



note that a subject's final visit is not always the sixth visit. 
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normalized volume of hippocampus x 1Q -3 

Figure 5: For the hippocampus, by-trajectory (red) has larger histogram separation be- 
tween converter (dashed line) and nonconverter (solid line) groups than by-CDR (blue). 
To illustrate this more clearly, also shown is the Gaussian curve for each of these four 
subject groups (plotted based on group mean (m) and standard deviation (s) indicated in 
the figure legend with the same 0.001 scaling as the x-axis). 
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(and Fig. [6J, the correlation coefficients for by-trajectory and by-CDR are 



shown for each biomarker, along with their associated p- values (Chambers 



et al. , 1993). Note that for 11 out of 14 brain regions, the correlation with 
by-trajectory is greater than the correlation with by-CDR (in bold), with 
by-trajectory meeting the significance threshold in 10 of these 11 regions. 
Further, for only two of the remaining four biomarkers - posterior cingulate 
and the clinical MMSE measure - does the correlation with by-CDR meet the 
significance threshold. Most notably, well-established markers for AD such 
as the hippocampus, lateral ventricles, and inferior parietal exhibited strong 
correlation with the by-trajectory definition. 

To further assess statistical significance of the comparison between by- 
trajectory and by-CDR correlations, we performed a correlated correlation 



test (Meng et al. , 1992), the appropriate test given that the same MCI sam- 
ple population was used in measuring correlations for both by-CDR and 
by-trajectory. This test (Table [2|d) reveals that the larger correlation of by- 
trajectory is statistically significant at the 0.05 level in nine brain regions (in 
bold). By contrast, conversion-by-CDR does not achieve a statistically sig- 
nificant advantage for any of the brain regions, nor with respect to MMSE. 
To mitigate the confounding effect of age in the AD biomarker measurements, 
we linearly adjusted them for age prior to performing the experiments de- 
scribed above. We also repeated these experiments without age adjustment. 
In this case, the number of regions for which by-trajectory's correlation ex- 
ceeded by-CDR's rose from 11 to 12 (with the addition of the entorhinal 
cortex), and the number of regions for which the larger correlation of by- 
trajectory was statistically significant rose from 9 to 10 (with the addition of 
the medial orbitofrontal GM). 

3.2.3. Identification of prognostic brain "biomarker" regions 

In the previous section, we validated conversion definitions using estab- 
lished (diagnostic) AD biomarker brain regions (with volumes measured at 
final visit). In this section, we will identify key prognostic biomarker brain 
regions via supervised feature selection, aiming first to identify the "essen- 
tial" subset of voxel features, i.e. the voxels (at initial visit) necessary for 
our classifier to well-discriminate the CT and NT classes. The brain regions 
(consistent with a registered brain atlas) within which these select voxels 
principally reside then identify our prognostic brain biomarker regions. Sim- 
ilarly, we will identify diagnostic regions, critical for discriminating between 
AD and Control subjects (using our AD-Control classifier). In both cases, 
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(a) 





By-trajectory 


By-CDR 




Correlation 


P-value 


Correlation 


P-value 


Biomarker 


coefficient 




coefficient 




Entorhinal cortex 


0.041 


0.486 


0.050 


0.385 


Fusiform gyrus 


0.430 


8.20E-15 


0.152 


0.008 


Hippocampus 


0.530 


2.00E-16 


0.231 


5.67E-05 


Inferior parietal GM 


0.513 


2.00E-16 


0.097 


0.094 


Lateral orbitofrontal GM 


0.077 


0.185 


0.078 


0.178 


Lateral ventricles 


0.586 


2.00E-16 


0.211 


0.000246 


Medial orbitofrontal GM 


0.239 


3.01E-05 


0.101 


0.081 


Parahippocampal gyrus 


0.095 


0.102 


0.060 


0.307 


Posterior cingulatc 


0.001 


0.981 


0.144 


0.013 


Precentral GM 


0.173 


0.003 


0.088 


0.128 


Superior frontal GM 


0.331 


4.50E-09 


0.094 


0.104 


Superior temporal GM 


0.478 


2.00E-16 


0.161 


0.005 


Total GM 


0.491 


2.00E-16 


0.246 


1.70E-05 


Total WM 


0.170 


0.003 


0.041 


0.477 


MMSE 


0.356 


2.57E-10 


0.453 


2.00E-16 



(b) 

P-value 



0.889 
2.17E-06 
2.42E-10 
0.985 
1.34E-09 
0.048 
0.616 
0.045 
0.230 
0.001 
3.10E-05 
0.0001 
0.069 
1.33E-06 
0.118 



Table 2: Correlation coefficients and associated p-values: (a) Correlation test results; (b) 
Correlated correlation test results for each of the regions in (a). Statistically significant 
results are shown in bold. 
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Figure 6: Comparison of correlation coefficients for by-trajectory and by-CDR. 



AD-Control classifier only 


intersection 


CT-NT classifier only 


Amygdala left 
Cingulate region right 
Entorhinal cortex right 
Inferior occipital gyrus right 
Medial occipitotemporal gyrus left 
Parahippocampal gyrus left 
Temporal lobe WM right 
Temporal pole right 


Hippocampal formation* left 
Hippocampal formation* right 
Entorhinal cortex* left 
Inferior temporal gyrus right 
Lateral occipitotemporal gyrus* right 
Parahippocampal gyrus* right 
Perirhinal cortex left 
Perirhinal cortex right 
Middle temporal gyrus right 
Uncus left 


Superior temporal gyrus* left 
Middle temporal gyrus left 
Precuneus right 

Lateral front-orbital gyrus* right 
Insula right 

Supramarginal gyrus* left 
Temporal lobe WM left 
Temporal pole left 
Medial front-orbital gyrus* left 



Table 3: Brain regions identified as biomarkers using voxel-based features and MFE. 



25 



the accuracy of the selected brain region biomarkers rests heavily on the ac- 
curacy of the supervised feature selection algorithm we employ. In Figures 
7(a)1 and [7(b)j we compare MFE and RFE feature elimination (i.e. feature 



selection via feature elimination) for both Control-AD classification and for 
CT-NT classification (for one representative, example trial). The curves show 
test set accuracy as a function of the number of retained features (which is 
reduced going from right to left). Note that the "MFE/MFE-slack" hybrid 
method (Aksu et al. 2010)) outperforms RFE for both brain classification 
tasks, achieving lower test set error rates, and with much fewer retained fea- 
tures. The circle, determined without use of the test set based on the rule in 
(Aksu et al. , 2010), marks the point at which we stopped eliminating features 
by MFE, thus determining the (trial's) retained voxel set. This MFE-RFE 
comparison (and the previous comparison in (Aksu et al. 2010[ )) supports 
our use of MFE to determine brain biomarkers. 

To relate the retained voxel set to anatomic regions in the brain, we 
overlaid the retained voxel set onto a registered atlas space. For CT-NT 
classification, to improve robustness, the final voxel set was formed from 
the union of the retained voxel sets from each of ten feature elimination 
trials (each using a different, randomly selected training sample subset). For 
AD-Control classification, the final voxel set came from a single trial (the 
only trial, from which the 10 CT-NT trials stemmed). For each of these 
two cases, overlaying the final voxel set onto the co-registered atlas (Atlas2, 
defined in Appendix A) yielded between 70-80 anatomic regions. For data 
interpretation purposes, we then identified a subset of (biomarker) regions 
using the following procedure. First, for each brain region, we measured the 
percentage of the region's voxels that are retained, sorted these percentages, 



and then plotted them. As shown in Fig. 8(a) , the resulting curve for the 
AD-Control case has a distinct knee, which we thus used as a threshold 
(0.125) to select the final, retained (diagnostic) regions for AD-Control. We 



used the same threshold for the CT-NT curve, shown in Fig. 8(b) This 
choice of threshold yields a reasonable number of regions - 19 for the CT-NT 
(prognostic) case and 21 for the AD-Control (diagnostic) case. 

The resulting sets of identified prognostic and diagnostic biomarkers are 
given in Table |3j along with their intersection. The diagnostic markers in the 
table include the majority of the known brain regions in the medial tempo- 
ral lobe involved in AD pathology. For example, hippocampus atrophy and 
lateral ventricle enlargement, particularly in its anterior aspects of the tem- 
poral horn, are considered the most prominent diagnostic markers for AD. 
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Entorhinal cortical regions, including the perirhinal cortex, are presumably 
the earliest sites of degeneration (Braak et al. 1997). Thus, independent 



identification by our AD-control classifier of known AD diagnostic biomark- 
ers establishes a reasonable basis for applying the same approach to identify 
prognostic biomarkers. The brain regions listed as CT-NT prognostic mark- 
ers include most known AD diagnostic markers (including 8 of the 12 regions 
from (Schuff et al. 2010[ ) (marked by *), 4 of which are also diagnostic mark- 
ers), indicating that some AD-linked pathological changes in these brain re- 
gions already occurred and remained active in a subset of MCI subjects who 
likely progress to AD rapidly. Conversely, the brain areas appearing only on 
the prognostic marker list are likely the most active areas of degeneration 
during this stage of progression to dementia. These structures tend to be 
the brain regions further away from the entorhinal cortex onto the parietal 
(Supramarginal gyrus, Precuneus) and temporal cortex (Superior temporal 
gyrus and Middle temporal gyrus) regions. All the brain structures listed 



in the table are known to be involved in AD (Braak et al. 1997 Chan et 



al. , 2003 Frisoni et al. , 2009). Thus, the markers in Table 3 suggest an 



interesting anatomic pattern of trajectory for MCI conversion to AD which 
conforms with the Brak and Brak hypothesis and previous imaging findings 
(IChan et al. I 120031 iFrisoni et all 120091). Moreover, the CT-NT regions 



uniquely found by our MFE-based procedure in Table [3] may be viewed as 
"putative" prognostic markers, and may warrant further investigation. 

Finally, we note that we have used a particular criterion (percentage of a 
region's voxels that are retained) to identify biomarker regions, starting from 
MFE-retained voxels. While our identified regions are plausible, it is possible 
that other (equally reasonable) criteria may produce different biomarker re- 
gion results. Thus, the biomarkers we identify should be viewed as anecdotal, 
identifying regions that figure prominently in our classifier's decisionmaking 
and also potentially assisting researchers in forming hypotheses about MCI- 
to-AD disease progression. However, we do not view the identified regions 
as definitive. 



3.2.4- Comparison with an SPM-based biomarker identification approach 

In the previous section, we used MFE to identify voxels as biomarkers 
for the CT and NT classes. Here, using the same CT-NT training and 
test populations, we will alternatively identify voxel-based biomarkers us- 
ing statistical testing with SPM5 (see: ( SPM website[ )). Subsequently we 
will present a classifier generalization accuracy comparison (where accuracy 
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Figure 7: Test set misclassification rate during the course of feature elimination, for (a) 
the AD-Control classifier and (b) the CT-NT classifier. 





(a) 



(b) 



Figure 8: Sorted retained voxel percentages for initial regions (a) AD-Control; b) CT-NT, 
used to select final regions (Sec. 3.2.3 1. 
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is again measured on the previous section's CT-NT (test) population) for 
these two biomarker detection methods. We determined SPM biomarkers 
as follows. The CT-NT training set population, being age-matched, is read- 
ily suitable for a paired t-test, an appropriate statistical test for determining 
SPM-identified biomarkers, i.e. voxels that discriminate between the CT and 
NT groups. In contrast with MFE's use of only one (out of 216) RAVENS 
subsamples (taken jointly from the GM, WM, ventricle maps), we performed 
t-tests on whole RAVENS maps (without subsampling), which makes the 
SPM-MFE comparison favorably biased towards SPM. More specifically our 
steps were as follows. First, for the GM and WM maps separately, we found 
using SPM that a large portion of each of these two tissues was statistically 
significant at the 0.05 level when correction for multiple comparisons was 
not applied. Next, we used SPM's FDR-based correction for multiple com- 
parisons - based on an SPM FDR cluster size of 5 voxels we found that the 
spatial extent of the statistically significant regions, at each of the levels 0.05, 
0.01, and 0.005, was approximately a subset of the above-mentioned spatial 
support found in the uncorrected case. Given that the number of significant 
voxels in any of these SPM experiments is, again due to no subsampling, 
much larger than the 11,293 voxels started from in the MFE case, we sim- 
ply 1) chose as our SPM result the result for 0.01 (FDR-corrected), 2) took 
from among those significant voxels the most significant 11,434 voxels in or- 
der to be able to compare MFE and SPM for f» the same number of voxels 
(biomarkers). To obtain the generalization accuracy for this SPM-identified 
biomarker voxel set, using the same training/test set as in the MFE experi- 
ment, we trained an SVM classifier and measured its generalization accuracy, 
which was found to be 0.76. This accuracy is somewhat lower than the 0.8 
accuracy of the previous section's CT-NT SVM classifier. Recalling that this 
comparison is actually favorably biased towards SPM, and further noticing 
the fact that MFE was able to maintain the 0.8 accuracy all the way down to 
2000 features (cf. Fig. |7(b) ), this experimental comparison provides another 
validation (beyond the comparison with RFE given earlier) for MFE-based 
feature/biomarker selection, applied to brain images. 



4. Conclusions 

We have presented an automated prognosticator of MCI-to-AD conver- 
sion based on brain morphometry derived from high resolution ADNI MR 
images. The primary novel contributions of our work are: i) casting MCI 
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prognostication as a novel machine learning problem lying somewhere be- 
tween supervised and unsupervised learning; ii) our proposal of a conversion 
definition which, unlike previous methods, exploits both rich phenotypical 
information in neuroimages and AD and control examples; iii) correlation 
testing and classifier accuracy evaluations to validate candidate conversion 
definitions; iv) prognostic biomarker discovery based on our conversion defi- 
nition. We demonstrated that our method achieved both better generaliza- 
tion accuracy and stronger, statistically significant, correlations with known 
brain region biomarkers than a predictor based on the clinical CDR score, 
the approach used in several past works. The brain structures identified 
as AD-control diagnostic markers and MCI conversion prognostic markers 
well conform with known brain atrophic patterns and progression trajecto- 
ries occurring in AD-afflicted brains. While the noisy nature of cognitive 
assessments, including MMSE, has been acknowledged in past works, in fu- 
ture we may extend our methodology to consider cognitive assessment data, 
both potentially as additional (baseline) input features and as additional or 
alternative prediction targets to our "conversion-by-trajectory" labels. We 
may also consider alternative ways to adjust for confounding effects of age, 
noting that (Schuff et al. 2010) has characterized the nonlinear dependence 
of age on brain region volumes. Finally, while we have focused on the MCI 
subpopulation here, our system could also potentially be used to detect, as 
possible misdiagnoses, subjects diagnosed as "Control" who are classified as 
MCI converters by our system. 
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Appendix A. Image processing 



The input to our processing is a Ti-weighted MR image of the head. 
First, using rigid-body registration implemented in FSL's linear image reg- 



istration tool FLIRT (Jenkinson et al. , 2001), we coarsely aligned this (3d) 
image with the MNI/ICBM atlas resampled to 1mm isotropic voxel dimen- 
sions (aka Atlasl). We used FSL version 3 (see: (FSL website)). Next, 
we removed non-brain anatomy from the aligned head image, using FSL's 
brain extraction tool BET (Smith et al. , 2002). We then segmented the 



resulting brain-only 3d image into the following five segments required by 
HAMMERflShen et al.[ |2002p1 : WM, GM, ventricles, (non-ventricle) CSF, 



(non-anatomy) background. Next, using as inputs an MNI atlas distributed 



with HAMMER (aka AtlasS ) and the ADNI participant's five-segment im- 



age, we performed two different HAMMER operations, generating: 1) the 
three "volumetric density" Ravens images: GM, WM, ventricles. The union 
of these three images (following some preprocessing) forms the set of features 
used by our classifier; 2) the 3d region-segmented image, whose region vol- 
umes are used for by-trajectory statistical validation in section 3.2.2 and for 
region biomarker identification in section 3.2.3 ^ 



19 We used a 2006 version of HAMMER, downloaded on November 8, 2006. 

20 Since the region-segmented version of the atlas distributed with HAMMER had much 
better segmentation quality than the five-segment version, as a pre-processing step we 
used the former to create a replacement for the latter (by simply taking the union of all 
the regions). The resulting five-segment atlas is dubbed 'Atlas2'. 

21 We measured normalized region volumes. We normalized by dividing by the sum of all 
(98) region volumes. This sum is essentially intracranial volume minus cerebellum volume, 
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Each of these two HAMMER operations performs a different type of atlas- 
based nonlinear registration - the one that generates the RAVENS tissue 
maps performs less aggressive warping between the five-segment image and 
the atlas in order both to combat noise inherent in the registration process 
and to preserve volume on a tissue-by-tissue basis (so as to properly detect 
and encode brain atrophy). Consequently, there is considerable individual 
variability in RAVENS images, which we mitigate in a standard way by 
smoothing with a Gaussian filter with an FWHM of 5mm. 

The sum of the voxel intensities over the three RAVENS maps is on the 
order of 10 6 and varies across individuals. Thus, prior to smoothing, we 
normalized each individual's Ravens maps, such that each individual has the 
same total volume. However, after the normalization and smoothing (aka 
"nsRAVENS" images), some areas of poor registration, manifesting 



with very low voxel values (including zero), will remain, 22 For a population 
of nsRAVENS images, these areas are considered to be outlier areas and are 
thus removed from each image in the population by thresholding P} The 
resulting images are the features used by our classification system. 

Appendix B. Correspondence Between Atlas-defined Regions and 
Those Defined in (Schuff et al., 2010) 



as our "intracranial region" list includes CSF in addition to brain regions and excludes 
the cerebellum. 

22 A substantial portion of the very low, nonzero voxel intensities are in fact introduced 
by the smoothing itself as it calculates each new voxel intensity as a weighted average of 
many neighboring voxels (thereby switching some voxels from zero (non-anatomy) to very 
low intensity values (anatomy), which essentially slightly grows the anatomy boundaries 
outward) . 

23 We calculated the threshold solely using the training population of our AD-Control 
classifier, and then applied the thresholding operation on the entire population of AD, 
MCI, and Control individuals. In this way, we were careful to exclude test examples from 
all phases of classifier training. 
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Entorhinal cortex 


Entorhinal cortex left /right 


Fusiform gyrus 


Lateral occipitotemporal gyrus right/left 


Hippocampus 


Hippocampal formation right/left 


Inferior parietal GM 


Supramarginal gyrus left /right, Angular gyrus right /left 


Lateral orbitofrontal GM 


Lateral front-orbital gyrus right/left 


Lateral ventricles 


Lateral ventricle left/right 


Medial orbitofrontal GM 


Medial front-orbital gyrus right/left 


Parahippocampal gyrus 


Parahippocampal gyrus right/left 


Posterior cingulatc 


Cingulate region left /right 


Precentral GM 


Precentral gyrus right/left 


Superior frontal GM 


Superior frontal gyrus left/right 


Superior temporal GM 


Superior temporal gyrus right/left 



Table B.4: Correspondence between the regions in (Schuff et al. 2010[ ) (left) (except "Total 
GM" and "Total WM") and our denned regions (right). 
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